Data Visulization with Python¶

Priyanga D. Talagala

IASSL Workshop on Data visualization with R and Python

27/5/2023

palmerpenguins data¶

  • The Palmer penguins dataset by Allison Horst, Alison Hill, and Kristen Gorman was first made publicly available as an R package.
  • Using palmerpenguins python package you can easily load the Palmer penguins into your python environment.
In [1]:
# import sys
# !{sys.executable} -m pip install palmerpenguins
from palmerpenguins import load_penguins

penguins = load_penguins()

penguins.head()
Out[1]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

Data Visualization with Python¶

There are several popular plotting packages available in Python.

Here are some of them:

1. Plotnine¶

  • Plotnine is a Python implementation of the grammar of graphics inspired by ggplot2.
  • It aims to provide a similar syntax and functionality to ggplot2 in R, allowing users to create elegant and customizable plots.
  • Plotnine is built on top of the matplotlib library, which is a widely used plotting library in Python.
In [7]:
## This code will generate an error message 
ggplot(penguins, aes(x='species')) +
    geom_bar(fill='steelblue') + 
    labs(x='Species', y='Count', title='Number of Penguins by Species')
  Cell In[7], line 2
    ggplot(penguins, aes(x='species')) +
                                        ^
SyntaxError: invalid syntax
In [2]:
# import sys
# !{sys.executable} -m pip install plotnine
from plotnine import *

ggplot(penguins, aes(x='species')) + geom_bar(fill='steelblue') + labs(x='Species', y='Count', title='Number of Penguins by Species')
Out[2]:
<Figure Size: (640 x 480)>
  • In Python, the backslash \ is used as a line continuation character.
  • we can use it to to split the code into multiple lines.
  • It is not necessary to use a backslash if the code fits on a single line.
In [3]:
ggplot(penguins, aes(x='species')) + \
    geom_bar(fill='steelblue') + \
    labs(x='Species', y='Count', title='Number of Penguins by Species')
Out[3]:
<Figure Size: (640 x 480)>

2. Matplotlib¶

  • Matplotlib is one of the most widely used plotting libraries in Python.
  • You can think of matplotlib as the basic plotting library in Python, similar to how base R provides basic plotting capabilities.
  • It provides a comprehensive set of functionalities to create various types of visualizations, from basic line plots to complex 3D plots.
  • Matplotlib is widely used in the Python data science community and has a large ecosystem of tools and packages built on top of it.
In [4]:
import matplotlib.pyplot as plt

species_counts = penguins["species"].value_counts()

plt.bar(species_counts.index, species_counts.values)
plt.xlabel("Species")
plt.ylabel("Count")
plt.title("Number of Penguins by Species")

plt.show()

Note :

  • In most cases, you don't need to explicitly call plt.show() to display the plot.
  • matplotlib will automatically show the plot when you execute the code.
  • However, in some specific environments (e.g., Jupyter Notebook) or when using certain IDEs, you may need to include plt.show() to ensure that the plot is displayed correctly.
  • If you're running the code outside of a Jupyter Notebook or an IDE that supports automatic plot display, you can include plt.show() at the end to make sure the plot is shown.

3. Seaborn¶

  • Just like R has additional packages like ggplot2 for more advanced and specialized plotting, Python has seaborn, plotly and other libraries that build on top of matplotlib to provide higher-level interfaces and additional functionality.
  • Seaborn is closely integrated with pandas data structures in Python.
  • It offers various built-in themes and color palettes, making it easier to create aesthetically pleasing plots.
In [5]:
## Method 1 using countplot

import seaborn as sns

ax = sns.countplot(data=penguins, x="species")
ax.set_title("Number of Penguins by Species")

plt.show()
In [6]:
sns.countplot(data=penguins, x="species", color="steelblue")
plt.xlabel("Species")
plt.ylabel("Count")
plt.title("Number of Penguins by Species")

plt.show()
In [7]:
## Using bar plot
species_counts = penguins["species"].value_counts()
species_counts

sns.barplot(x=species_counts.index, y=species_counts.values, color="steelblue")
plt.xlabel("Species")
plt.ylabel("Count")
plt.title("Number of Penguins by Species")

plt.show()

The popularity of the libraries¶

  • The popularity of a library is influenced by several factors, including its functionality, ease of use, community support, and historical adoption.

  • While ggplot is a popular and highly regarded library in the R programming language, its adoption in Python has been relatively limited.

  • There are a few reasons for this:

  1. Familiarity:

    • matplotlib has been around for a long time and is deeply ingrained in the Python data science ecosystem.
    • Many users are already familiar with its syntax and have invested time in learning and using it.
    • Similarly, seaborn has gained popularity for its easy-to-use interface and attractive default styles. As a result, these libraries have a larger user base and community support.
  2. Compatibility:

    • matplotlib is a fundamental plotting library in Python and is widely used across different domains.
    • It integrates well with other libraries and frameworks, making it a reliable choice for many users.
    • seaborn builds on top of matplotlib and provides additional statistical plotting capabilities, making it a natural choice for users who want to create visually appealing and informative plots.
  3. Active Community and Documentation:

    • The popularity of a library is often influenced by the size and activity of its community.
    • matplotlib and seaborn have been around for a long time, and as a result, they have a large user community, extensive documentation, and numerous examples and tutorials available.
    • This makes it easier for new users to get started and find solutions to their problems.

Data Visualization With Seaborn¶

Different categories of plots in Seaborn¶

Seaborn divides its plots into the following categories based on the types of relationships between variables:

  1. Relational Plots: These plots are used to visualize the relationship between two numeric variables. Some examples include scatter plots (scatterplot()), line plots (lineplot()), and joint plots (jointplot()).

  2. Categorical Plots: These plots are used to show the relationship between one numeric variable and one categorical variable. They can help visualize distributions, comparisons, and aggregations across categories. Some examples include bar plots (barplot()), count plots (countplot()), and box plots (boxplot()).

  3. Distribution Plots: These plots are used to visualize the distribution of a single variable or the relationship between multiple variables. They help in understanding the underlying distribution and identifying patterns. Some examples include histograms (histplot()), kernel density estimation plots (kdeplot()), and violin plots (violinplot()).

  4. Regression Plots: These plots are used to visualize the relationship between two numeric variables and fit regression models to the data. They help in understanding the linear or non-linear relationship between variables. Some examples include scatter plots with regression lines (regplot()), residual plots (residplot()), and regression joint plots (jointplot() with regression).

  5. Matrix Plots: These plots are used to display the relationships between multiple variables as a matrix. They are particularly useful for visualizing correlation matrices and covariance matrices. Some examples include heatmap plots (heatmap()) and clustermap plots (clustermap()).

These categories provide a comprehensive set of plotting options in Seaborn to address various types of relationships and data structures.

For more plotting options visit the Python Graph Gallery

Anatomy of a chart¶

  • Seaborn is a high-level library built on top of Matplotlib, which means that many of the concepts and vocabulary used in Matplotlib are still applicable when working with Seaborn.
  • Understanding the anatomy of a Matplotlib chart can help you better comprehend and navigate Seaborn plots.

Source : https://www.python-graph-gallery.com/seaborn/ anatomy.png

The basic steps to creating plots with Seaborn¶

  1. Prepare some data

  2. Control figure aesthetics

  3. Plot with Seaborn

  4. Further customize your plot

  5. Show your plot

There are two ways to create a plot using seaborn

  1. The first way (recommended way) is to pass your DataFrame to the data= argument, while passing column names to the axes arguments,x= and y=.
  2. The second way is to directly pass in Series of data to the axes arguments.

Here's an example of creating a bar plot using the two different methods in Seaborn:

In [8]:
# Method 1: Pass DataFrame to data= argument and column name to x=
sns.countplot(data = penguins, x = 'species')
plt.show()
In [9]:
# Method 2: Pass Series of data to x= argument
sns.countplot(x = penguins['species'])
plt.show()
In [10]:
# Create a scatter plot using Seaborn
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g')
plt.show()

Mapping Colours¶

In [11]:
# Create a scatter plot with color and shape aesthetics using Seaborn
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', style='species')
plt.show()
In [12]:
# Create a scatter plot with conditional coloring using Seaborn
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue=penguins['flipper_length_mm'] < 205)
plt.show()

Setting Colours¶

In [13]:
# Create a scatter plot with purple color using Seaborn
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', color='purple')
plt.show()

Adding Layers¶

In [14]:
# Create a scatter plot with color and shape based on species using Seaborn
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', style='species')

# Add a 2D density plot
sns.kdeplot(data=penguins, x='flipper_length_mm', y='body_mass_g', color='black', fill=True, alpha=0.3)

# Show the plot
plt.show()

Scales¶

In [15]:
sns.scatterplot(
    data=penguins,  # Specify the DataFrame
    x='flipper_length_mm',  # Specify the x-axis variable
    y='body_mass_g',  # Specify the y-axis variable
    hue='species',  # Color the points based on species
    style='island'  # Use different point shapes based on island
)
plt.show()

Scales manual¶

In [16]:
# Define the color palette
cols = {"Adelie": "red", "Chinstrap": "blue", "Gentoo": "darkgreen"}

# Create a scatter plot
sns.scatterplot(
    data=penguins,  # Specify the DataFrame
    x='flipper_length_mm',  # Specify the x-axis variable
    y='body_mass_g',  # Specify the y-axis variable
    hue='species',  # Color the points based on species
    palette=cols  # Use the defined color palette
)


# Display the plot
plt.show()
In [17]:
sns.scatterplot(
    data=penguins,
    x='flipper_length_mm',
    y='body_mass_g',
    hue='bill_length_mm',
    style='island'
)

# Add any additional customization or annotations here

plt.show()
In [18]:
# Create a scatter plot using seaborn
sns.scatterplot(
    data=penguins,  # Specify the data
    x='flipper_length_mm',  # Set the x-axis variable
    y='body_mass_g',  # Set the y-axis variable
    hue='species',  # Set the variable for color
    palette='Dark2'  # Set the color palette to 'Dark2'
)

plt.show()  # Display the plot
In [19]:
import brewer2mpl

# Print all available color palettes
brewer2mpl.print_maps()
Sequential
Blues     :  {3, 4, 5, 6, 7, 8, 9}
BuGn      :  {3, 4, 5, 6, 7, 8, 9}
BuPu      :  {3, 4, 5, 6, 7, 8, 9}
GnBu      :  {3, 4, 5, 6, 7, 8, 9}
Greens    :  {3, 4, 5, 6, 7, 8, 9}
Greys     :  {3, 4, 5, 6, 7, 8, 9}
OrRd      :  {3, 4, 5, 6, 7, 8, 9}
Oranges   :  {3, 4, 5, 6, 7, 8, 9}
PuBu      :  {3, 4, 5, 6, 7, 8, 9}
PuBuGn    :  {3, 4, 5, 6, 7, 8, 9}
PuRd      :  {3, 4, 5, 6, 7, 8, 9}
Purples   :  {3, 4, 5, 6, 7, 8, 9}
RdPu      :  {3, 4, 5, 6, 7, 8, 9}
Reds      :  {3, 4, 5, 6, 7, 8, 9}
YlGn      :  {3, 4, 5, 6, 7, 8, 9}
YlGnBu    :  {3, 4, 5, 6, 7, 8, 9}
YlOrBr    :  {3, 4, 5, 6, 7, 8, 9}
YlOrRd    :  {3, 4, 5, 6, 7, 8, 9}
Diverging
BrBG      :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
PRGn      :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
PiYG      :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
PuOr      :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
RdBu      :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
RdGy      :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
RdYlBu    :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
RdYlGn    :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
Spectral  :  {3, 4, 5, 6, 7, 8, 9, 10, 11}
Qualitative
Accent    :  {3, 4, 5, 6, 7, 8}
Dark2     :  {3, 4, 5, 6, 7, 8}
Paired    :  {3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
Pastel1   :  {3, 4, 5, 6, 7, 8, 9}
Pastel2   :  {3, 4, 5, 6, 7, 8}
Set1      :  {3, 4, 5, 6, 7, 8, 9}
Set2      :  {3, 4, 5, 6, 7, 8}
Set3      :  {3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
In [20]:
# Create scatter plot using seaborn and viridis
sns.scatterplot(
    data=penguins,              # Data frame containing the data
    x='flipper_length_mm',     # Variable for the x-axis
    y='body_mass_g',           # Variable for the y-axis
    hue='species',             # Variable to differentiate the colors
    palette='viridis'          # Color map to use for the plot
)

# Display the plot
plt.show()
  • viridis and RColorBrewer provide different color scales that are robust to color-blindness.
  • For details and an interactive palette selection tools see http://colorbrewer.org
In [36]:
# Create scatter plot using Seaborn
sns.scatterplot(
    data=penguins,
    x='flipper_length_mm',
    y='body_mass_g',
    hue='species',
    style='species',
    palette='plasma',
    markers={'Adelie': 's', 'Gentoo': 'o', 'Chinstrap': '^'},
)

# Set x-axis breaks
plt.xticks([170, 200, 230])

# Set y-axis to logarithmic scale
plt.yscale('log')

# Display the plot
plt.show()

Facets¶

In [22]:
# Create facets for each species
g = sns.FacetGrid(penguins, col='species')
g.map(sns.scatterplot, 'flipper_length_mm', 'body_mass_g')

# Display the plot
plt.show()
In [23]:
# Create scatter plot using seaborn and facet_wrap
# The col_wrap parameter in the FacetGrid function specifies the maximum number of columns in the grid of facets
g = sns.FacetGrid(penguins, col='species', col_wrap=3, sharex=False, sharey=False)
g.map(sns.scatterplot, 'flipper_length_mm', 'body_mass_g')

# Display the plot
plt.show()
In [24]:
g = sns.FacetGrid(penguins, col='species', row='sex')
g.map(sns.scatterplot, 'flipper_length_mm', 'body_mass_g')
# Display the plot
plt.show()

Coordinates¶

In [25]:
sns.countplot(data=penguins, x='species')
plt.show()
In [26]:
# Cord flip
sns.countplot(data=penguins, y='species', palette='Set1')
plt.show()
In [ ]:
 

Themes¶

Built-in-themes¶

  • These are complete themes which control all non-data display.

  • Seaborn provides several built-in themes that you can use to customize the appearance of your plots.

  • They control all non-data display.

  • Here are some of the available themes:

    • "darkgrid": Dark background with gridlines.

    • "whitegrid": White background with gridlines.

    • "dark": Dark background without gridlines.

    • "white": White background without gridlines.

    • "ticks": White background with tick marks.

In [27]:
# Set the theme
sns.set_theme(style="darkgrid")

# Create the scatter plot
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', style='species')

# Display the plot
plt.show()
In [28]:
# Set the theme
sns.set_theme(style="whitegrid")

# Create the scatter plot
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', style='species')

# Display the plot
plt.show()
In [29]:
# Set the theme
sns.set_theme(style="dark")

# Create the scatter plot
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', style='species')

# Display the plot
plt.show()
In [30]:
# Create the figure and subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# Scatter plot
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', ax=axs[0, 0])
# Box plot
sns.boxplot(data=penguins, x='species', y='bill_depth_mm', ax=axs[0, 1])
# Histogram with KDE
sns.histplot(data=penguins, x='body_mass_g', hue='species', element='step', kde=True, ax=axs[1, 0])

# Remove empty subplot
fig.delaxes(axs[1, 1])

# Adjust the layout
plt.tight_layout()

# Add a plot title for the entire panel
fig.suptitle("Penguin Data Analysis", fontsize=16, y=1)


# Show the plot
plt.show()
In [31]:
# Create the figure and grid layout
fig = plt.figure(figsize=(10, 8))  # Create a new figure object with size 10x8 inches
gs = fig.add_gridspec(2, 2)  # Add a grid specification with 2 rows and 2 columns

# Scatter plot
ax1 = fig.add_subplot(gs[0, 0])
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species', ax=ax1)
ax1.text(0.05, 0.9, 'A', transform=ax1.transAxes, fontsize=16, fontweight='bold')

# Modify the legend title and position
ax1.legend_.set_title("Penguin Species")
ax1.legend_.set_bbox_to_anchor((1.1, 1.0))  # Adjust the bbox_to_anchor value to change the legend position

# Box plot
ax2 = fig.add_subplot(gs[0, 1])
sns.boxplot(data=penguins, x='species', y='bill_depth_mm', ax=ax2)
ax2.text(0.05, 0.9, 'B', transform=ax2.transAxes, fontsize=16, fontweight='bold')

# Histogram with KDE
ax3 = fig.add_subplot(gs[1, :]) # The colon to indicate that I want to select all columns in this row
sns.histplot(data=penguins, x='body_mass_g', hue='species', element='step', kde=True, ax=ax3)
ax3.text(0.05, 0.9, 'C', transform=ax3.transAxes, fontsize=16, fontweight='bold')

# Adjust the layout
fig.tight_layout()

# Add the plot annotation and adjust the position
fig.suptitle('Size measurements for adult foraging penguins near Palmer Station, Antarctica', fontsize=16, y=1)
#fig.text(0.05, 0.05, 'A', fontsize=16, fontweight='bold')

# Show the plot
plt.show()

Plotly¶

  • In R, the plotly::ggplotly() function is primarily used to convert ggplot objects to interactive Plotly plots.
  • To create interactive plots in Python, you can use the plotly.express module from Plotly
  • This module provides a simplified interface for creating interactive visualizations.
  • If you're working with Plotly Express, you can directly utilize its API to create interactive plots without relying on Seaborn or its functionalities.
In [40]:
import plotly.express as px

# Load dataset from seaborn
penguins = sns.load_dataset('penguins')

# Create a scatter plot using seaborn
scatterplot = sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species')

# Convert seaborn plot to interactive plotly plot
scatterplot_plotly = px.scatter(penguins, x='flipper_length_mm', y='body_mass_g', color='species')

# Display the interactive plot
scatterplot_plotly.show()